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" 1. Introduction. First motivated by information theoretic considera- 

tions, context tree models have been introduced by Rissanen in [37] as a 
,— I generalization of discrete Markov models. Since then, they have been widely 

used in different areas of applied probability and statistics, from Bioinfor- 
matics [6, 13] to Linguistics [22, 23]. Sometimes also called Variable Length 
Markov Chain, a context tree source is a stochastic process whose memory 
length may vary with the past: the probability distribution of each symbol 
^ depends on a finite part of the past, the length of which is a function of the 

past itself. Such a relevant part of the past is called a context, and the set 
y—i of all contexts can be represented as a labeled tree called the context tree of 

^ the process. 

Rissanen provided in his seminal paper a pruning algorithm called Context 
for identifying the tree of a process, given a sample Xi, . . . , Xn- He proved 
his estimator Context to be weakly consistent when the tree of contexts is 
^ finite; this result was later completed by a series of papers, including [36] 

who got rid of the necessity to have a known bound on the maximal length 
I of the memory. On the other hand, penalized maximum likelihood criteria 

^ were proved to be strongly consistent in [18, 24]. More recently, several 

efforts have been made to obtain non-asymptotic bounds on the probability 
of correct estimation (see [27] and references therein). 

But the problem of estimation is not the only problem of interest con- 
cerning context trees. In fact, these models are widely used because of the 
remarkable tradeoff they offer between expressivity and simplicity: by pro- 
viding memory only where necessary, they form a very rich and powerful 
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family of simple processes for the approximation of arbitrary sources. In cod- 
ing theory, for instance, they are the keystone of the universal coder termed 
Context Tree Weighting (CTW) (see [44, 14]). The idea behind CTW is that 
a double mixture, over all trees (with a given maximal depth) and, within 
each tree model, over all parameters, can be computed efficiently. Using this 
double mixture as a predictive coding distribution leads to a coder that is 
proved to satisfy an oracle inequality with respect to the natural loss of 
Information Theory. 

The aim of this paper is to show that model selection, and not only 
aggregation, can be used in an oracle approach for the problem of context 
tree estimation with Kiillback loss. For every finite context tree r (see Section 
2 for details), we can estimate the transition probabilities of the source P 
by those Pr associated to r of the empirical measure. The oracle approach 
consists in looking for the tree minimizing the Kiillback risk of the estimators 
Pr - This choice of the loss function, while causing a few technical difficulties, 
emerges naturally from an information theoretic point of view. Following the 
terminology of [35], the Kiillback risk appears as the excess risk associated 
to the logarithmic loss, which is an (idealized) codelength in coding theory. 
Hence, the Kiillback risk appears as a redundancy term caused by the fact 
that the coder does not know in hindsight which source is to be coded. 

When the source has a finite context tree r<j, the oracle approach asymp- 
totically coincides with the consistency approach, because the tree that has 
the smallest risk is the minimal tree of the source for large numbers of ob- 
servations. This is no longer the case when the true tree is infinite or at least 
large compared to the number of data. Then, the Kiillback loss of Pr is de- 
composed into a bias term measuring the approximation properties of r and 
a variance term measuring the error of estimation. Identification procedures 
look for the minimal tree with no bias, whereas oracle procedures look for a 
tree balancing bias and variance. 

The identification approach is inspired from the classical asymptotic sit- 
uation where the bias term, when non null, is very large compared to the 
variance term. In this case, under-estimation is easily avoided and the proce- 
dures mainly focus on avoiding over-estimation, see for example [18, 23, 27]. 
On the other hand, the oracle approach is inspired by non asymptotic sit- 
uations where the true tree is large (compared to the number of data) and 
can even be infinite. In particular, there exist trees with a bias much smaller 
than the variance: an oracle is typically a small subtree of r<j realizing a good 
tradeoff between the bias (which decreases with the tree size) and the vari- 
ance (which, in turn, increases with the tree size). This modern approach is 
more natural to tackle realistic situations with reasonable number of obser- 
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vations; namely, the set of context trees is used as a toolbox and we want 
to select the tool that is best suited, in terms of Kiillback loss. 

The oracle point of view comes from statistical learning theory where it 
is now well understood in classical problems of non parametric statistics 
as regression or density estimation (see [34] and the references therein for 
an introduction). A classical method of selection consists in choosing the 
model minimizing an empirical loss plus some penalty proportional to the 
complexity of the model. This principle is the one used in [4, 8, 9, 34]. 
Another famous method consists in aggregating a finite set of functions, 
i.e. to choose a linear combination of previous estimators or approximating 
functions. An important example of such procedure is the Lasso, where the 
aggregating weights are chosen by minimization of an £i-penalized criterion, 
see [7, 15, 20, 28, 36, 41, 45, 46, 47]. Complexity penalization procedures 
are theoretically more interesting because they cover in the same framework 
several general problems, whereas -penalties are preferred in linear prob- 
lems for their computational efficiency. We propose a penalization procedure 
here and we verify that the estimator can be efficiently computed. 

Penalized log-likelihood estimators have been studied in context tree esti- 
mation, for example by [18]. These authors proved that BIC-like estimators 
(see Section 3.3) are asymptotically consistent when the source has a fi- 
nite context tree, whatever the leading constant in the BIC-like penalty. 
Moreover, they showed that BIC estimators can be computed efficiently in 
practice. However, much less is known about the risk of the selected esti- 
mator, when the actual context tree is infinite. In addition, the question of 
the choice of the leading constant in the BIC penalty for finite number of 
data remains open. Actually, [23] proved that, for a fixed number of data, 
the set of trees selected by BIC-like penalties for varying leading constants 
is exactly the set of champions, where the champion of size k is the tree 
maximizing the log-likelihood among the trees with less than k degrees of 
freedom. 

Our first goal in this paper is to present a general method to obtain oracle 
inequalities for a selected r, that is, an inequality between the Kiillback loss 
of Pf and the minimum of the Kiillback losses of the Pr- We emphasize the 
central role of concentration inequalities to develop these results for context 
tree selection, which makes a clear link with model selection theory, as pre- 
sented, for example, in [4, 8, 9] and many others after them. Actually, all 
the general theorems are consequences of a concentration condition, that we 
verify for mixing processes. We obtain then a class of examples where the 
general results apply. For these processes, our penalty takes a BIC form, 
with a sufficiently large leading constant. As a corollary, we prove therefore 
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that BIC-like estimators have oracle properties when the data are sufficiently 
mixing. From a theoretical point of view, the difficulty comes from the fact 
that new concentration inequalities are required for words that are not con- 
texts, which prevent us from using the martingale approach of [17, 27]. 

We study also the slope heuristic of [10] in context tree estimation. The 
heuristic, presented more formally in Section 3.4 states the existence of a 
minimal penalty peUj^jj^ under which the selected tree has huge complexity 
and over which this complexity is much smaller. Moreover, it states that 
2penjjjjjj is an optimal penalty, i.e. that the selected estimator satisfies an 
asymptotically optimal oracle inequality. The reasons of this phenomenon 
rely on a fine analysis of the ideal penalty, see [1, 3]. The ideal penalty is 
the sum of two terms and the slope heuristic essentially holds when these 
two terms are asymptotically equal, see [3]. The heuristic does not hold in 
general as was proved in linear regression by [2]. In that case, [2] proved that 
an optimal penalty is given by Cpenj^j^ for a constant C, different from 2. 

We study the standard slope heuristic, with an optimal penalty equal to 
2penjjjjjj, under our concentration assumption and make therefore a contri- 
bution to this growing area of statistical learning [3, 10, 31, 32]. Note that 
few proofs are available for non-Hilbertian risks [33, 38], and, up to our 
knowledge, our results are the ffist ones in a discrete, non i.i.d framework. 
The heuristic is important since it underlies the slope algorithm presented 
in [3] to calibrate leading constants in the penalties. In the mixing case, the 
algorithm provides an answer to the question of practical calibration of the 
leading constant in the BIC penalty. 

We present a simulation study to illustrate our results. The simulations 
are conducted in the particular family of renewal sources (see [40, 25]) for 
which bias and variance terms can be computed easily, which is not the case 
in general. The simulations show that for relatively small sample sizes of 
finite sources, the BIC estimator, while failing to recover the true model, 
does satisfy nice oracle properties; the slope algorithm improves slightly on 
that, for a very moderately increased computational cost. 

The paper is organized as follows. Section 2 presents some notation used 
all along the paper. Section 3 presents our general results. In particular, we 
show how to deduce from concentration inequalities 1) good penalties yield- 
ing oracle properties of the selected estimators and 2) theoretical evidences 
for the slope heuristic. Section 4 presents an application of our general ap- 
proach to mixing processes. We show that they satisfy good concentration 
properties and we deduce oracle properties of the BIC estimators in this 
case. Section 5 presents our simulation study and the proofs are postponed 
to the appendix. 
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2. Notation. We use the conventions 0/0 = +oo and 01n(+oo) = 0. 
For all a G M, \a] denotes the smallest integer larger than or equal to a and 

[aj the largest integer smaller than or equal to a. Given two sequences, we 
use the notation Un = 0{vn) and Un = o{vn) when there exists a constant 
C such that \un\ < C'lu^l, respectively, when there exists a sequence — > 
such that \un\ < ^n\vn\- All the random variables are defined on a probability 
space (H, A',P) and we denote by E the expectation with respect to P. 

Let ^ be a finite set, with cardinality \A\, and, for all x > 0, let log(a;) := 
ln(a;)/ Ind^l) be the logarithm in base \A\. For all n in N*, let := 
Ufc=i,...,n^*^ and let A* := Ufee^A'^. A is called an alphabet and the elements 
of A* are called words. For all integers m and n such that m < n, for all 
words {am, ■ ■ ■ , an) € we denote by 

■= (am, ■■■,an) and |a^| := n - m + 1 . 

The notation is extended to semi-infinite sequences where m = — oo, in 
that case := {ai)i<n and, by definition, for all n € Z, \a'^^\ := oo. The 
space of semi-infinite sequences is denoted by A^^ and wc define A*-"^^ := 
yl~^U A*. For every {u;,u') G ^(~^) x A*, let cou' denotes the concatenation 
of u and uj'. 

Definition 1. A context tree is a subset r C A^"^) such that, for every 
semi-infinite sequence uj = al^, there exists a unique cOr & t such that 

The set of context trees is denoted by T. For every t E T, let c/(t) := 

max{ \lu\ , Lu G t}. For every integer k > 1, let Tk '■= {t £ T : d^r) < k}. 
When dij) < oo, we say that t is finite. For every finite tree t, let N{t) 
denote the number of elements of t. 

The set T is provided with the following partial order relation 

T <T iff er, 3aZl ^ ^ • = ■ 

In the sequel, we will make repeted use of the following abuse of notation. 
When ^ is a set of trees, we will write uj £ A instead oi 3t £ A : oj £ t. 
We will in particular use repeatedly the notation Vw G A instead of Vr G 
A, G r. 

Definition 2. A transition kernel is a function 

pJa-^xA ^ [0,1] 
I {oj, a) !->■ P{a\ui) 
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such that, for every oj G Yl,a&A-P{^\'^) ~ 

A chain (Xk)kez is a stationary ergodic stochastic process on A^. 

A chain (Xfe)fegz with distribution fi on is said to be compatible with 
transition kernel P if the later is a regular version of the conditional proba- 
bilities of the former 

\/{u,a) G A-^ xA, 11 (Xo = a\Xzl^ = u) = P{a\uj) . 

For every chain {Xf;)kez, with distribution /i compatible with a transition 
kernel P, for every context tree r, we denote by Pr a regular version of the 
following conditional probability: 

V(a;, a) Et X A, Pr(a|a;) := fx (^Xq = al-'^^Ii];^! = w j . 

For all finite context trees r, let P-r be the transition kernel defined by 

y{uji,io,a) G A^^ X T X A, Pr{a\ujiui) := PT{a\oj) , 

and for all {u, a) E A* x a, let be the probability measure defined recur- 
sively by 



Uriuja) := 



firi^)PTiO'\^i) if 3(wi,a;2) & t x A* : lo = . 
Ii{uja,) else. 



Definition 3. Let Ai be the set of all stationary, ergodic, probability 
measures on A^. For all finite context trees r let M.t {fi ^ M. : h = Ht}- 
For every G M.r, (t, Pt) is called a probabilistic context tree and fi is called 
a probabilistic context tree source with tree r. 

For all transition kernels Q, if iJ,{Q{a\Lo) = =^ P(a|a;) = 0} = 1, we 
define 

' P{a\uj) " 



Q{a[ 



K^{P, Q) := I dnioj) V P{a\u;) In 

We take the convention that, if ji{Q{a\uj) = =^ P{a\uj) = 0} < 1, then 
K^{P,Q) := +00. For any finite r, for any probability measure on A^ 
compatible with a transition kernel P and for any family of transition prob- 
abilities {Q{.\uj))ajeT, we also define 

wSt a&A ^ ^ 



The observation set is defined by the projection X" of a chain (Xfe)fegz with 
distribution /x compatible with a transition kernel P. Our goal is to estimate 
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P from X". The risk of the estimators will be measured with the Kiillback 
loss Kfj^. For all t < n and all ui in , we define 

1 * 

' n-\uj\ + l ^k-\u.\+i-'^ 

k=\uj\ 

A word oj such that /2„_i(a;) > is called feasible, a tree r such that every 
word is feasible is also called feasible and the set of feasible trees is denoted 
by T. We also denote, for all k < n, hy ■= Tk ri T . 

For all T ^ denote by Pr and Pr the following functions: 

V(a;,tj',a) G r X X ^, Pr{a\uj) = f "^"""^ , Pr{a\Juj)=Pr{a\u) . 

Note that, for alH < n — 1 and (to, a) G A^''^') x A, X^^g^ = fJ-ti^)- 

Hence, for all feasible trees r, Pr defines a transition kernel estimating P. 
Our goal in this paper is to select a tree t ^ F such that, given a confidence 
level (5 e (0, 1), 




In the previous inequality, the constant C is expected to be close to 1, the 
subset F' d F \s supposed to be large and the remainder term R{t,6) should 
not be too large. In that case, we say that r satisfies an oracle inequality. 
Let us mention here that, for every t £ F, we have, see Lemma 24, 

(2) [P,Pr) = {P,Pr)+K^^ [Pr,Pr) • 

In (2), Kfj^ (P, P,-) is called the bias term and K^^ (^P-j-jPr^ is called the 
variance term of the risk. An alternative to (1) is the following 

(3) f!^K^ (p, Pp) < mf^ [CKf, {P,Pr)+ R{t, 6)]^ > 1 - 6 . 
3. General approach. 

3.1. Assumptions. Let us recall the definition of typicality (see [18, 17] 
for example). 

Definition 4. For every rj € (0, 1) and k < n, a word uo is called {k, rf)- 
typical if 

(1 - ?7)/x(cj) < ftkiuj) < (1 + r])K^) ■ 
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The set of {k, r])-typical words is denoted by T{k, rf) and let 
(4) 



(77) := I o; G : Va G yl, a; G T(n - 1, r/) and uoa G T(n, rf) | 



Concentration is the central tool to develop model selection theory, as 
shown in the series of works [4, 8, 9] and many authors after them. In 
this section, we assume the following concentration condition. There exist 
Qn — > 0, ifn —5- 0, { (i„ < 71 — 1, d„ — )• oo }, p„ — )■ 0, and an event VLconc 
satisfying P(r2^Q„g) < ifn, such that 

(CC) V5g (0,1), VtG{n-l,n}, £ Td„, Vw G r U r x A , 

¥\\ IJltiu}) - < Wp„^(w)ln( M + ^„ln( ^ ) i n l^conc ) > 1 - -5 . 



Let us now choose dn as in assumption (CC) and let tt be a probability mea- 
sure on such that, for all io G A'^", '^a&A'^i^'^) ~ '^{^)- Assumption 
(CC) and a union bound ensure that ¥{0,good} ^ 1 — 2(5 — where 

Qgood ■■= |v(r, a) G Td^ x A, e r, 

|/i„_i(a;) - n{uj)\ < \ pnfJ.{u}) In ( / ) + £»„ In 



tt{uj)6 J \ 71(00)6 

(5) |/i„(u;a) - /x(a;a)| < i/p„/i(a;a) In ( — ) + £in In ' 



iT{uja)6 J \ TT{uja)5 

Let A'n^ 00, A^f^ 00, let (5 G (0, 1), let 

r„(7r,5,a;,a) =ln(^-^-^^ ( A^p^ v A^,^)^^) , 
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and let 

(6) ((5) := {r G : V^^ G r, Va G A, 

finiuja) =Oor/I„(a;a) > 2r„(7r, 5, w, a)| . 

(7) j;(5):={rG Jd„ :V^^Gr,VaG^, 

fi{u}a) = or n{ijja) > r„(7r, 5, w, a)| . 

(8) J-?)(5):={rG Jd„ :V^^Gr,VaG^, 

^.{ijja) = or ii{ijja) > 4r„(7r, 5, a)| . 

3.2. A typicality result. Our first result is that assumption (CC) implies 
typicality of the words in Ti^\6) U More precisely, the following 
proposition holds. 

Proposition 5. Let 6 > and let Ti"'\6), and Ti'^\6) be the 

sets defined in (6), (7) and (8) respectively. Let Ttypif]) be the set defined 
in (4) and let ^good be the event (5). There exists Uq such that, for all 
n > Uo, on ^good, C C Moreover, there exists rj = 

^ ( \/~PT ^ ~(2y ) •^^'-^ that, on figood, o,ll the words in J>((5) belong to 

Remark 1. Hereafter, we work on the event i^good that has "large prob- 
ability", i.e. larger than 1 — 25 — (fn- The collection of trees that we are 
interested in is Ti"'\6). Proposition 5 states that, on Qgoodi this collection 

(2) 

is "large" since it contains the collection [6) of words with sufficiently 
large probability of occurrence and the words in J^i"'\6) are typical since 
they belong to J-"* ((5). 

3.3. Model selection. The purpose of this section is to study penalized 
log-likelihood estimators defined in general as follow. Let pen : T — )• R4. and 
let 

(9) f:= argmin I V] V] Pr(a|a;) In ( ^ ^ ) + pen(r) > . 

A particular case of such estimators is given by the family of penalties 

/ N I . I AT/ X Inn 

pen^{T) = c A N{T) . 

n 



10 



M. LERASLE AND A. GARIVIER 



This is, up to a constant, the penalty term suggested by the BIC criterion 
of [39]: thus, in the foUowing, this penalty will be termed "BIC-like" or 
will even, with some abuse, be called a BIC penalty. The corresponding 
estimators have been studied in a series of papers initiated by [18] in context 
tree estimation. It is proved in [18] that BIC estimators are consistent when 
there exists a finite tree r such that fx = fir- We are interested here in oracle 
properties of the selected estimator, that is, we want to compare K^{P, Pf) 
with inf K^{P, Pr). The following theorem is the main result of the 
paper. 

Theorem 6. Let {Xn)n£Z be a stationary ergodic process satisfying as- 
sumption (CC). Let 6 > and let Ti"\6) be the set defined in (6). Let 
^good be the event (5). Let 

Pmin = min ^ P{a\uj) . 

(aj,a)eJ-i"'((5)xA:P{a|aj)^0 

Let L > 6 + 18p^^ and let t be the penalized estimator defined in (9), with 
(10) 



Vr G pen(T) >L ^ + 




1 



There exist Uq and a constant such that, for all n > Uq, on ^goodj '^^ 
have 

Vr G Fi''\6), C,K^{P, P^) < K^{P,Pr) + pen(r) . 

Remark 2. The condition L > 6 + 18p^j^ in Theorem 6 can be replaced 
by L > 6 + 36pmin, where 

Pmin = min P{a\uj) , 

(Lj,a)e J"* (5) X A:P(a\uj)=/=0 

by using the typicality Proposition 5. 

Remark 3. Theorem 6 reduces the problem of model selection proce- 
dure to the proof of a concentration inequality of the type (CC). We will 
show in Section 4 that such concentration inequalities are available when 
{Xn)n£Z is geometrically 0-mixing. In that case, we can take dn = O(lnn) 
and pn = 0(n^^), Qn = 0{n^^ Inn). Therefore, choosing for tt the uniform 
probability measure on A'^"'^^, the condition (10) holds for BIC penalties 
peng(r) if c is large enough. However, the concentration in that case in- 
volves some unknown constant in pn- Moreover, the constant L > 3 + Gp^^ 
proposed is too large for practical use. In order to overcome this problem, 
we propose to study in Section 3.4 the slope algorithm of [10]. 
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3.4. Slope algorithm. The slope algorithm has been introduced in [10], 
it provides a data-driven calibration of the leading constant in a penalty. It 
is based on the slope algorithm that we adapt here to the particular case of 
context tree estimation. Let us recall that that the selected tree r is obtained 
as a minimizer of the penalized criterion (9). The heuristic describes the 
typical behavior of the selected tree r when pen(r) = Cpen^;j(r), pen^;j(r) is 
a well chosen complexity measure of r (typically the BIC shape |^|A^(r)^^ 
or the variance term Kf^^^P^-, Pt)) and C is an increasing leading constant. 
It states more precisely that there exists a constant Cmin such that 

SHI When C < Cmm, the complexity of the selected model ]3engf^{T) is very 

large, typically of the order of max,- pen^;j(r). 
SH2 When C > Cmm, the complexity pen^^(r) becomes abruptly much 

smaller. 

SH3 When C = 2Cmin, the selected estimator satisfies an oracle inequality 
(1) with a leading constant close to 1. 

Let us now assume that we want to calibrate the leading constant L in 
a penalty of the form pen(r) = Lpenj^^(r), where pen^^(r) is a data-driven 
shape for the penalty (typically here, we will use the BIC shape \A\N{t)^-^). 
The slope algorithm evaluates this leading constant in the following data- 
driven way. 

SAl For all L > 0, compute the complexity pen^^(r) of the model selected 

by the penalty pen(r) = Lpen^^(r). 
SA2 Choose Lmm, such that this complexity is very large for L < Lmin and 

much smaller for L > Lmin- 
SA3 Choose finally the constant L = 2L^i^. 

The algorithm is efficient if, for some constant Lq and some shape penalty 
pen^^j satisfying the slope heuristic, we have 

{Lo - o(l))penrfrf(r) < pen^;j(r) < {Lo + o(l))penrfrf(r) . 

Actually, by SA2, we observe a jump in the complexity of the selected 
model when L ~ Lmin, hence, by SHI, SH2, 

(Linin - o(l))penrfrf(T) < CminPen,;,(r) < (Lmin + o(l))penrfrf(T) . 

Therefore, the model selected by SA3 with the penalty 2Lminpen^^(r) ~ 
2Cmmpens;j(r) satisfies an oracle inequality, thanks to SH3. 

The words "very large" and "much smaller" in Step SA2, borrowed from 
[3, 10], are not very clear. We refer to [3] Section 3.3 for a detailed discus- 
sion on what they mean in this context and for precise suggestions on the 
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implementation of the slope algorithm. We refer also to [5] and Section 5 for 
practical implementations of the slope algorithm in M-estimation, respec- 
tively in our framework. 

This section presents some theoretical evidences for the slope heuristic. 
We show the jump of the complexity of the selected model around a minimal 
penalty as predicted by SHI, SH2 and we prove that a penalty equal to 
2 times the minimal one has oracle properties. We do not prove that the 
leading constant is asymptotically equal to one as predicted by SH3. We 
present finally a theorem that emphasizes what remains to be done to obtain 
a complete proof of SH3. The complexity is A(t) = K^^^Pt, Pr)- We prove 
in Lemma 19 an upper bound of this term for all r in 

Theorem 7. Let {Xn)nez be a stationary ergodic process satisfying the 
concentration condition (CC). Let 6 > and let J-i"'\6) and J-i,{6) be the 
sets defined in (6) and (7). Let ^good be the event (5). Let r > and let t 
be the penalized estimator defined in (9), with 

(11) Vr e < pen(r) < (1 - r)K^^{Pr,Pr) . 

Let prain = inf (aj,a)eJ-*(<5)xA,P(a|aj)7^o -Pl^l"^) • -^ei bc the maximizer over 
Ti'^\6) of K^j,^{Pr, Pt) and let To be a minimizer of K^{P, Pr) over Fi^\5) . 
Assume that there exist ipMP and an event ^mp satisfying P(r2jvfp) > 
1 — 'Pmp, such that, on VLmp, 

K^{P,Pr^) = o(^K,^^{Pr^,Prj) , K,{P,Pr^) = o(^K^^^{Pr^,Pr^)) . 

There exists L := L{p^iji,r) such that, on ^good H ^mp, we have 

(12) K^^iP^,P^) > LK^^^iPr^,Pr^) . 

Remark 4. It is convenient to assume that there exists a constant po > 
such that Prain ^ Po- In that case it comes from the proof of Theorem 7 that 
L{prain, 1") > L'r for some L' depending only on po- Theorem 7 states that 
a penalty smaller than Kfj_^{Pr, Pt) selects a model with maximal value 
of K^^{Pf,Pf). This is exactly SHI for the complexity measure A(r) = 

K^^{Pr,Pr). 

Remark 5. The extra assumption K^{P,Pr^) = o (-Pr* i -Pr* ) ) is 

natural, since the model with maximal complexity is likely to have a lot of 
leaves, and therefore a small bias. 
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Remark 6. The assumption K^{P,Pr^) = o {Pr^ , Pr^) j means 

that the risk of an oracle is much smaller than the maximal risk. It is a 
natural assumption for the slope heuristic to hold, actually, in SH2, SH3, 
the interesting models are those with a complexity much smaller than the 
biggest one. 

Theorem 8. Let {Xn)nez be a stationary ergodic process satisfying the 
concentration condition (CC). Let 6 > and let J-i"'\5) be the set defined in 
(6). Let ilgood be the event (5). Let ri > 0, r2 > and let r be the penalized 
estimator defined in (9), with 
(13) 

Vr G J-i")(,5), il+n)K^^{Pr,Pr)<pen{T)<{l + r2)K^^{Pr,Pr) . 

Let be the maximizer of K^^{Pt-, Pr) over J%["^((5) and let Tq be a mini- 
mizer of K^{P, Pr) over F^"'\5). Assume that there exist ipMP — ^ and an 
event iljv/p satisfying ¥{Qmp) > 1 — Vmp, such that, on VLmp, 

K^{P,Pr,)=o(K,^^{Pr^,Prj), K,{P,Pr^)=o(K^^^iPr,,Pr^)) , 



Vr E J-i")(5), dK^{P,Pr)K^^{Pr,Pr) = O ( i^^,^ (P.„ P.J 



On O,good n r^AfP, we have 



(14) K^4Pr,Pr)=o(K^^JPr^,P, 



In addition, a sequence rj = O [ (ly V — ^ ) exists such that, Vei > 0, 



Ve2 > 0, there exists := C^(ri, r2, ei, e2,Pmm) such that, on i^good<^^MP, 
(15) 



(1 - ei - 62) A ri - ei 



1 

€2 

r(") 



Kf,iP,Pr) < C^K^{P,Pr, 



Remark 7. The assumption Vr G ^^ '{6), K^{P,Pr)K^^{Pr,Pr) = 

a ^ LC^T^ [Pt^ ) -fr* ) ^ means that there is no model with a lot of bias and a big 
variance. It typically holds when trees with large variance are those with a 
lot of leaves whereas trees with a large bias are the small ones. 

Remark 8. SH2 immediately follows from (14) since, as soon as the 
penalty becomes larger than K^^ (Pr, Pr) the complexity of r becomes much 
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smaller than the largest one. SH3 follows partially from (15). The oracle 
property implies in particular the convergence of the quantities Pf(a|(^) to 
Pr^{a\Lo) SO that L^°''^^ — t- 1. Therefore, the condition ri > L^°''^^ to obtain 
the oracle inequality became asymptotically ri > 1, and the condition (13) 
on the penalty becomes pen(T) > 2pen^^^{T). (15) states then that 2pen^:^^ 
is asymptotically a penalty yielding an oracle inequality, but this is not 
exactly SH3 which states, moreover, that — t- 1. 

As mentioned, we did not completely prove point SH3 of the heuristic. 
The following theorem emphasizes the missing point of the proof. In order 
to state the result, let us define, for all t £ 

L{Pr) := ^ (finiuja) - n{uja)) In 

(tij,a)erx A 

Theorem 9. Let (X„)„g2 be a stationary ergodic process satisfying the 
concentration condition (CC). Let 6 > and let J^i"^(5) be the set defined in 
(6). Let i^good be the event (5). Let ri > 0, r2 > and let t be the penalized 
estimator defined in (9), with a penalty term satisfying (13). Let Tq be a 
minimizer of K^{P,Pt-) over F^''\5). Let n < l,u < 1 and 

^mis := { Vr G j1")((5), L{Pr) - L{P^^) < uK^{P, P^) + vK^{P, P^J } . 

On Vtgood n ^mis, there exists r] = O \/a^ ) '^"'^^ 

(16) 

{l-u)K^{P,Pr) + {l + ri-u-7])K^.{P^,P^) < {l + r2+v + r])K^{P,Pr) . 

Remark 9. Assume that there exist u— )-0, w— )-0, )-0 such that 
Pjilmis} > 1 — V'- Then, from (16), for any ri > 0, the complexity of 
the selected model is the one of an oracle, that should be much smaller 
than the maximal one as already explained. This is SH2 with penjjjij^(r) = 
Kf,^{Pr,Pr). Moreover, for n = = 1, i.e. pen(T) = 2K^^{Pr,Pr) = 
2pen^jjjj(T), (16) shows that the risk of the selected model is asymptotically 
exactly the one of an oracle. This is SH3. 

Remark 10. The weakness of Theorem 8 comes from the fact that we 
were not able with our approach to prove that ^mis holds with large prob- 
ability with li, u — )• 0. We only obtain this result for some n > 0, f > 0. 



PT{a\ijj) 
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Remark 11. In the mixing case that we develop in Section 4, we can 
show that ^mis holds with large probability with n, u — t- if there exists a 
fixed set J> containing J^i"^ [6) with large probability such that Card { } = 
0(n") see Proposition 12 in Section B.2. 

Remark 12. Theorem 9 shows a difference between context tree estima- 
tion and other classical problems of regression or density estimation, where 
the slope heuristic has been proved. In these frameworks, it is easy to prove 
that ^Imis holds with large probability as a consequence of Benett's con- 
centration inequality, see [3, 32]. The main difficulty for proving the slope 
heuristic is then to show that, with our notation, K^^{Pt, Pt) — K^(Pt, Pt) 
see for example [3]. In context tree estimation, this last result is a direct 
consequence of typicality, as shown by Lemma 19 and the problem of i^mis 
seems harder. 

4. Application in the mixing case. We showed in the previous sec- 
tion that oracle inequalities and the slope heuristic can be derived from the 
concentration condition (CC). Our aim in this section is to show that such 
concentration result holds for mixing processes. 

Let us recall the definition of /3-mixing and (/^-mixing coefficients, due 
respectively to [43] and [29]. Let (E,X,F) be a probability space and let A 
and B be two a-algebras included in X. We define 

1 ( ^ ^ 

^{A,B)= sup sup {F {B\A} -F{B}} . 

AeA, ¥{A}>OBeB 

The first sup is taken among all the finite partitions of H (^i)i=i,..../ and 
J such that, for all i = 1, G ^ and for all j = 1, . . . , J, 

B, e b: ' 

For all stationary sequences of variables {Xn)nei^ defined on (H, X,F), let 

Pk = P{a{X„i<0),a{X„i>k)), (l)k = H^{X„i<0),a{X„i>k)) . 

The process (X„)„gz is said to be /3-mixing when /3fc — )• as A; — )• oo, it is 
said to be (/>- mixing when c/ifc — t- as A; — ^ cx). It is easy to check, see for 
example inequality (1.11) in [12], that I3{A,B) < (f){A,B) so that {Xn)nez 
is (/)-mixing implies is /3-mixing. 




16 



M. LERASLE AND A. GARIVIER 



Theorem 10. Let {Xn)n£Z be a cp-mixing process satisfying 

oo 

(MC) $ := < oo . 

A;=0 

(ND) 3A < 1 : a) G A~^^ x A, P{a\uj) < A . 

Let dn < n — 1, Qn < {n — l)/4, t G {n — l,n} and assume that Vn '■= 

dn + In ^ i- Let L^ = 4 ^<I> + ■ There exists an event J^conp satisfying 

fi { recoup } < 2n2/?g„ such that, for all y > and all lo G 

(17) 

I y n — + 1 n — |a;| + 1 I 

Remark 13. An important example of application is the case of geo- 
metrically mixing processes, i.e., when the following assumption holds. 

(GMC) 3{Lmix,7rmx) G {K)^ ^ Vfc G N, /3fc < L^i.e-^™-'^ . 

Then we can choose dn = logn, qn = O(logn) in Theorem 10 and we obtain 
that geometrically mixing processes satisfy (CC) with pn = (n — logn)~^L^, 
Qn = 0{n~^ logn), Lfn = and ^Iconc = ^coup- 

A immediate consequence of the previous remark is that the following 
corollary of Theorem 6 holds. 

Corollary 11. Let {Xn)n& be a (p-mixing process satisfying (MC), 
(ND) and (GMC). Let dn = log(n), let tt be the uniform probability mea- 
sure on Td„ and let 

jc-]") = G J", : Va G ^, (n - |a;| + l)/2„(a;a) G {0} U [(Inn)^, +oo [ } . 

Let L > 18 + 81p^Jj^ and let r be the penalized estimator defined in (9), with 

(18) Vr G J^i''\ pen(r) > LLI\A\N{t)— . 

n 

There exist Uq and a constant such that, for all n > Uq, we have 

2 



{ Vr G C^K^iP, P^) < K^{P, Pr) + pen(r) } > 1 



Remark 14. This corollary shows that BIG estimators have oracle prop- 
erties, provided that the constant c is sufficiently large. The drawback of this 
result is that the constant is unknown in practice. We recommend to use 
the slope algorithm to overcome this problem. We will present some simu- 
lations to emphasize the advantages of this approach. 
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Fig 1. Bias (dark gray) and variance (light gray) part of the risk (black). Observe 
that the oracle has a size of order 5, much smaller than the exact tree. 

5. Simulation Study. In this section, we illustrate our theoretical re- 
sults by simulation experiments in the family of renewal processes. A renewal 
process is defined here as a binary valued process {A = {0,1}) for which the 
distances of successives occurrences of symbol 1 are independent, identically 
distributed variables. In our simulations, the renewal distribution was Pois- 
son with parameter 3. The models we considered were all renewal processes 
with renewal times bounded by Kq = 14. Their context trees are the sub- 
trees of r = I lO'^, k = 0, . . . Ko} LI [ 0-^'°+^ }. In this experiment, we used a 
sample size of n = 500. 

5.1. Bias and variance of the risk. Figure 1 shows the bias and vari- 
ance terms of the risk of the trees r^^ = { lO'^, = 0, . . . fco } U { o'^"^^ } in 
the previous model as a function of ko- The bias can be computed easily; 
the variance part is estimated by a Monte-Carlo method over = 10000 
experiments. 

5.2. The slope phenomenon. We illustrate the slope phenomenon. The 
measure of complexity is Kp;{Pr, P^^), where P^^ is a bootstrap estimator 
of Pr and we plot the complexity of the tree selected by minimization of the 
criterion 




Pr{a\uj) 



1 



) 



+ cKp{Pr,Pfn 
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Fig 2. Example of slope heuristic. Observe the sudden change in behavior around 
the minimal penalty. 



for the positive constants c. We clearly see that when c is smaller than 1 the 
complexity is the largest possible and this is the content of Theorem 7. We 
also observe that when c is slightly larger than 1 there is a sudden decrease 
in the complexity, which is the content of Theorem 8. The result are shown 
in Figure 2. For these small values of c, very large models are chosen, and 
the bootstrap estimation of their complexity is not reliable: this explains the 
absence of monotonicity in the left-most part of the graph, as well as in the 
right-most part of Figure 3. 

5.3. Slope algorithm. In this section, we show the performances of the 
slope algorithm. We take n = 500 and = 10000. We use the following 
penalization procedures. 

Method 1 : BIC. The penalty is equal to the BIC penalty with c = 1/2. 

Method 2 : BIC-|-Slope. The penalty term is equal to the BIC penalty 
and the constant c is computed with the slope algorithm SA1-SA2-SA3 of 
section 3.4. In the step SA2, we choose for Lmin the constant minimizing a 
discrete derivative of the function L i— )• penj;^(r(L)), where t(L) is the tree 
selected by L\A\N{T){lnn)/n. 

Method 3 : Resampling. For all words uj, the conditional probabilities 
are estimated by a bootstrap method and, following Efron's heuristic (see 
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Fig 3. Variance term (black), bootstrap estimator (gray) and BIC penalty with 
c=l/2 (dotted). 

[21]), K^^{Pr,Pr) is then estimated by the quantity Kji{Pr,P^^) and, fol- 
lowing Theorem 8 the penalty is taken equal to 2K-p{Pr, Pr^)- 

Method 4 : Resampling+Slope. The penalty term is LKji{Pr, P^^), 
where the constant L is evaluated by the slope algorithm, with the complex- 
ity K-jx{Pr, Pr^)- Step SA2 of the slope algorithm is evaluated in the same 
way as in Method 2. 

The motivation to use resampling methods comes from the fact that the 
variance term is better estimated than with the BIC penalty, as shown by 
Figure 3. 

Figure 4 presents histograms of the models selected by methods 1-4, for 
n = 1000, N = 10000. 

Finally, in order to illustrate the oracle properties of the selected estima- 
tors, we compute in the table 1 the values of the ratio 

K^,{P,P9) 
mUe^^_,KiP,Pr) 

for the different methods. We give the mean value over N = 10000 experi- 
ments of the risk ratio and the standard deviation is also indicated. 

It is interesting to remark that the oracle performances of the BIC esti- 
mator are improved by the slope algorithm. On the other hand, resampling 
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0.7 




Fig 4. Histograms of the models selected by methods 1-4, respectively filled in black, 
slanted hatch, horizontal hatch and gray; models selected by AIC are depicted in 
light gray. Observe that the models selected by BIC penalties are smaller than those 
obtained with resampling penalties, both with or without the slope algorithm. As ex- 
pected, for such a small sample the selected estimator is always significantly smaller 
than the actual context tree of the source. 

Table 1 
Comparative table of oracle ratios 



Method 


BIC 


BIC+Slope 


Resampling 


Res+Slope 


risk ratio 


1.5245 (0.8568) 


1.2665 (0.5657) 


1.2751 (0.3230) 


1.6707 (1.9702) 



estimators do not seem to improve significantly the results. Moreover, the 
slope algorithm combined with this penalization method give the worst re- 
sults here. As the computational cost of resampling methods is quite heavy, 
we do not recommend to use it in practice. On the other hand, the slope 
algorithm does not add a significant computational cost and can be used to 
choose the leading constant. 

Note that, in general, Methods 3 and 4 involve a minimization problem 
(9) that is not computationally tractable. Here, the particular structure 
of renewal processes allowed us to consider all the models; in more general 
settings, we propose to proceed in two steps: first, select a set of trees defined 
by the image of c i— )• tbic{c) c > 0, where, for all c > 0, tbic{c) is the tree 
selected by the BIC penalty with constant c. Then, select among those trees 
with the proposed resampling methods. 
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6. Conclusion. We developed an oracle approach for context tree se- 
lection with Kiillback loss. Our presentation emphasizes the central role 
of concentration inequalities for sequences of words. We proved that such 
concentration inequalities hold for geometrically (^-mixing sequences. We ob- 
tained as a corollary of our general approach some oracle properties of the 
BIC-like estimators in this framework. 

We also provided both numerical and theoretical justification for the use of 
the slope heuristic in this problem in order to calibrate the leading constant 
in the penalties. This provides in particular an answer for the practical 
choice of the leading constant in the BIC-like penalty. Actually, [18] proved 
the consistency of the BIC estimators for any value of c. On the other hand, 
[23] proved that, for any finite value of n, the set of trees selected by the BIC- 
like penalties (penj,(r))c>o is the set of champions, where, for any k < logn, 
the champion of size k is the one maximizing the log-likelihood among those 
trees r such that N{t) < k. 

There is a growing interest for the slope heuristic, see for example [10, 3, 
32, 31, 38, 33]. However, the theoretical analysis of this method is still in its 
beginning and our results are a significative contribution. In particular, we 
provide, up to our knowledge, the first proof of the relevance of the slope 
heuristic in a discrete non i.i.d framework. 

Our results also emphazise the interest of the oracle approach, compared 
to the identification approach. In fact, a large part of the interest of con- 
text tree models lies in the fact that any stationary ergodic source can be 
approached, in the Kiillback information distance, by context tree sources: 
hence, the use of these models is not restricted to cases when the true source 
belongs to one of them. Besides, even if it is finite, the true source's con- 
text tree is likely not to be the best model to use for small samples, as 
illustrated in our simulation study. In fact, we showed that the BIC estima- 
tor presents nice oracle properties, and that it can be further improved by 
choosing the leading constant in the penalty adaptively. This result justifies 
the use of context tree models in practical applications much more than the 
consistency properties that are usually mentioned. 

An important question related to the oracle approach is to obtain upper 
bounds for the risk of the selected estimator. We showed in Section C that 
such bounds can be obtained with continuity rates. Actually, these conti- 
nuity rates provide upper bounds for the bias term and yield good mixing 
properties, so that we can use the upper bounds of the variance term ob- 
tained in the mixing case. 

The (/^-mixing properties assumed in Section 4 are somewhat restrictive. 
It would be interesting to work with weaker assumptions, for example, with 
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weaker mixing coefficients, as (p (see [19] for a definition). These mixing 
coefficients are sufficient to generalize some results of model selection (see 
[30, 31] for example). Another interesting problem would be to look for 
natural mixing properties of context tree sources. As mentioned, we proved 
such properties in Section C using a theorem of [16]. This last theorem 
was obtained as a consequence of the existence of a constructive perfect 
simulation scheme for the chains. New perfect simulation schemes have been 
developed recently [26], using less restrictive assumptions on the chains. It 
would be interesting to see what mixing-properties can be deduced from 
these new constructions. 

Acknowledgements. We would like to thank gratefully Roberto I. Oliveira 
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Galves for many discussions and fruitful advices during the redaction of the 
paper. 
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APPENDIX A: PROOFS IN THE GENERAL CASE 

A.l. Proof of Proposition 5. Let {u},a) G x A. By definition, 

if ^{ujo) 7^ 0, 
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^(1) ^ ^(2) 



1 1 



) 



/i(a;a) . 
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Moreover, 



ti{u,) > ii(ua) >\nl -ij^ j ( Ai,'Vn V Afg. ) 



> 

Hence, 



Thus, on figood) all the words in belong to Ttyp{r]), with 



^(1) ' ^(2) 



Let now (w, a) € (5) x A. On ^good^ if V'ni'^o) / 0, we have 



7r(a;)(5 



Hence, by definition of J'i"''((5), 



(19) J^l-^-^^j ?„(-)<mM. 

As a consequence, oj G Ti,{d) for all n such that 

Let finally (w, a) G ^i^^(^) x ^. On Ugood, if lJ.{^a) ^ 0, we have 



Hence, by definition of lo G J^i"\6) for all n such that 



1 1 1 

< 



A 



(1) A (2) - 2 



26 M. LERASLE AND A. GARIVIER 

A. 2. Proof of Theorem 6. Let r C From Lemma 15 and the 

definition of r, we have 

K^{P, P^) <K^{P, Pr) - K^{Pr, Pr) + pen(r) 
(20) + (^K^^{P^, Pr) + Kj,{Pr, Pr) - pen(f ) ) + L{Pr) - L{Pr) . 

It comes from Proposition 5 that, on ^Igoodi all the words in r U r belong 



to Ttyp{ri) for some rj = O I . -j^ V — ^ ) . Therefore, Lemma 17 gives, for 



any e > 0, 



L{Pr) - L{Pr) <(e + 0(r?)) {K^{P, Pr) + K^{P, Pr)) 

+ (1 + 0(r?))^— ( K^^ {Pr, Pr) + K^^iPr, Pr] 



Assume now that 77 < 1/3 and let e = 1/2. By typicality, it holds that 

M^'^-* < ^K^r,^ and hence, from (20), we deduce that (1/2 - 0{r]))K^{P, Pr) 
is upper-bounded by 

(21) 

+ 0(7?)) K^iP,Pr) + (1 + 0{r^))^K^^iPr,Pr) " Kj,{Pr,Pr) 

+ pen(r) + (1 + 0(r/)) f 1 + ^ ) K^.^{Pr, %) + i^^(Pf , Pr) - pen(?) . 



In addition, from Lemma 18, there exists rj' = O ^(A^i^^ A A^^^) ^/^j such 
that, for all 6 + 18^?^^^ < L' < L, on Ogood 



( 1 + 0(7?) ) f 1 + ^ ) K^^iPr, Pr) + Kj:{Pr, Pr) 
\ Pram J 



)2 ^ 



^" / (a;,a)e 

For n sufficiently large, we have L' + rj < L, hence 

(22) (l+0(r/)) (1 + ^) K^^iP^,P^) + K^{Pr,P9) < pen(f) 

We conclude the proof, plugging (22) in (21). 
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A. 3. Proof of Theorem 7. Thanks to Proposition 5, there exists Uq 
such that for n > Uq, on Ogoodi -7^1"^ (^) C Let r_ be any element in 

J-i,{6). T minimizes over the following criterion: 

Critr_(T) := V ^n{u}a)ln I ^ \ ] 

{ui,a)eTxA \ ^\ \ I / 

+ / (i/i(a;a)ln(P(a|a;)) -L(P^_) + pen(r) . 

Thanks to Lemma 15, we have 

Crit,_(r) = K^{P,Pr) - KpiPr, Pr) + L{Pr) - L{Pr_) + pen(r) . 

Thanks to Proposition 5, there exists rji = O (^^J~^ ^ such that 

T-^{6) C Ttyp{rii). Hence, from (47) in Lemma 19, for all r G on ^Igood 

K^{Pr,Pr) - K^^{Pr,Pr)\ < O (vi) K {Pr , Pr) • 

In addition, from Lemma 17, for all t ^ t^, on Qgood, we have 

\L{Pr)-L{Pr,)\<AiT.,T) , 

where 



A(r_,r) := {2 + v^)^J L^'^^ x 



(23) 



K^{P, Pr_ ) + K^(P, Pr) ) K^^ {Pr, Pr) + K^^_ (P,_ , P,„ 



The inequalities CritT-o(r) < Cfitroijo) and CritT-^(T) < CritT-^(rv,) can there- 
fore be rewritten 

(24) K^{P,Pr.)-{T-r^,)K^^SPro.Pr^ 

> K^{P,Pr) - {l + m)K^^{Pr,Pr) " A{To,t) , 

(25) K^{P,Pr.)-[r-^,)K,^^{Pr^,Pr^) 

> K^{P,Pr) - (1 + m)K^AP9,P9) - Ain,T) . 
Recall that, on the event ^^hyp, 

K,{P,Pr.) = o(K^^^{Pr^,Pr, 
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Inequality (25) can then be satisfied only if one of the following condition 
holds. 

(CI) 3L := L(r, p^in) : K^^ {P,, P9) > LK^^^ (P,, , J . 

(C2) 3L := L(r,p^i„) : K^,(P^, P^) = o (i^^^^ (P.., P^j) 

and i^^(P,P^)>Li^^,^ (P.,, P.J . 

In fact, under (C2), 



hence A{to,?) = o {K ^,{P, P^)) ■ Thus inequality (24) and K^(P,P.J = 
o(i^^,jP.,,P,j) yield 

K^{P,P^) = o{K^{P,P^)) . 

This is a contradiction. Hence, condition (CI) is fulfilled. By typicality, we 
have Pmin/2 < Pmin < 3pmin/2 for n Sufficiently large, thus L{r,p^in) > 
L'{r,pram) which concludes the proof of the Theorem. 

A. 4. Proof of Theorem 8. r minimizes over J"i"^((5) the following 
criterion 

Crit(T) := V /I„(wa)ln [ ^ \ ) 
/ A \ Pr(a / 

(bj,a)&TxA \ ' / 

+ / d^i{uja) ln(P(a|tj)) - L(P.J + pen(r) . 



Thanks to Lemma 15, we have 

Crit(T) = K^iP,Pr) - Kji{P^,Pr) + L{Pr) - L{Pr^) + pen(r) . 

Thanks to Proposition 5, there exists rji = O (^^J~^ ^ such that 

J^t(5) C Ttyp{rii). Hence, from (47) in Lemma 19, for all r G -P*(^), on i^good 

Kj,{Pr,Pr) - K^^{Pr,Pr)\ < 0(r/l)K^, (P„ Pj . 

In addition, from Lemma 17, for all r 7^ Tq, on O,good, we have 

|L(PJ-L(P,J|<A(to,t), 
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where A{t,t') is defined in (23). The inequahties Crit(r) < Crit(ro) can 
therefore be rewritten 

(26) K^iP,PrJ + (r2 + 0{m))K,^^{Pr^,Pr^) 

> K^{P,Pr) + (n - 0{ni))K^^{P^,P^) - A{to,t) . 

In ^(To,r), all the terms are, on ^hyp, o{Kf^^^{PT-^, P^^)), therefore, (14) 
follows from (26). Moreover, using repeatedly the inequalities, valid for any 
a > 0, 6 > 0, e > 0, 

Va + b < \/a + Vb and 2Vab < ea + - , 

e 

we obtain, for any ei > 0, e2 > 0, 



2V Ll^°'^y (K^(P, P.J + K,{P, P^) ) [K^^{P^. Pp) + K^^^ iPr^,Pr^) 

< (ei + e2)K^{P, P^) + ( ei + J K^^{P^, P^) 



We plug this inequality in (26), we obtain 

/ r{ro,T) \ 

i2 + ^—+0im)jK^{P,Pr^) 

+ (r2 + ( 1 + ) + 0(r?i))i^^,„ (Pr„, PrJ 

> (1 - ei - e2)K^(P,Pf) + (ri - ei - - 0{m))K^.^ {P^, P^). 

e2 

A. 5. Proof of Theorem 9. Thanks to Proposition 5, there exists Uq 
such that for n > Uo, on O,good, J^i'^\s) C -F^(5). r minimizes over J^["'*((5) 
the following criterion 

Crit(T) := Yl /^"(^«) 1" ( ) + 

/ d/i(a;a) ln{P{a\uj)) - L{Pr^) + pen(r) . 
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Thanks to Lemma 15, we have 

Crit(T) = K^{P,Pr) - Kp{Pr,Pr) + L{Pr) - L{Pr^) + pen(r) . 

Thanks to Proposition 5, there exists rji = O ^ such that 

J^-k{6) C Ttyp{rii). Hence, from (47) in Lemma 19, for aU r G J'^*{6), on flgood 

K^{Pr,Pr) - K^^{Pr,Pr)\ < O (m) K ^^Pr , Pr) • 

In addition, on ^mis, we have 

LiPr) - L{Pr„) < uK^iP, Pr) + vK^{P, A-J . 

Hence, the equation Crit(r) < Crit(ro) imphes, with the conditions on the 
penalty 

(1 - u)K^{P,Pr) + (1 + ri - n - m)K^^{P9, P?) 

< {l+v)K^iP,Pr) + {l + r2 + v + rii)Kf,^{P^,P9) . 

APPENDIX B: PROOFS IN THE MIXING CASE 

B.l. Proof of Theorem 10. Let us write t = PnTn + "Un, with < 
Un < fn- Let us now denote, for all A; = 1, . . . the set Ip^+i-k defined as: 

• Ip^+i-k = {IV [t - dn - krn + l] , . . ■ ,t - {k - l)r.„} if /c < p„; 

• lp„+i-k = { 1, . . . , n„ } if A; = p„ + 1 and Un > dn, 

• Ip„+i-.k = if A; = p„ + 1 and m„ < dn- 

Let /ii = l„„>d„, kn = L(Pn-l + /ii)/2j, (n = [{pn - 2 + hi) /2\ . We 
apply Lemma 21 to the process (Xj)jgz and to the sets {Jk)k=i,...,k„ and 
( Jfc)fc=i,.../„ where, for all A; = 0, ... , kn, Jk = h-hi+2k and, for ah k = 
0, . . . ,£n, J'k = l2-hi+2k- We obtain the random variables (Yi). k„ , and 
{Y-) .^^e„ ^j,^ such that, 

1. for all A; = 0, ... , A:„, (li)jgjj. has the same distribution as (Xi)jgj^ and, 
for all A: = 0, . . . has the same distribution as (Xj)jgj/^ 

2. for aU A; = 1, . . . , kn, (i^j)jeJfe is independent of {Xi,Yi)i^Ut<k-iJt and, 
for ah A; = 1, . . . (>"/)iG4 is independent of {Xi, li)ieut<fc_i J,'; 

3. for every A; = 0, . . . , A:„, P { {Xi)i^j^ / iYi)ieJk } < i^n + dn)Pq„ and, 
for every A: = 0, P { iXi),^j> + {Yl),^,r^ } < (r„ + d„)/3g„. 
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Let Jlcoup be the following set 

(27) ncoup = jv/c = 0, . . . , kn, (Xj)iej, = {Yi)i^j^ 

and Vfe = 0, . . . , £n: {Xi)i^v = iXDi^v 



It comes from point 3 that P { 9.1^^ ] <{kn + in + 2) (rn + dn)liq,, < 2n'^Pg^ 
Let now oj e A^'^'^+i) such that /i(w) ^ 0. For every k <t, let 



k-\u\+r 



CO 



Let also, if 3A; G {0, . . . : {z — |a;| + 1, . . . ,z} C Jj(. 



otherwise, let 



Z' 



l(y') 



^(w) 



For = 1 - /ii, . . . Xfe = Ifc n { |a;| , . . . , i }. On ficoup, we have 

t Pn Pn 

i=|a;| A;=l— /ii iGXj. fc=l— fei ieXj. 

Hence, for all a; > 0, G (0, 1), a union bound gives 







> 






> 


(28) P 1 


i=|w| 


^ X n ^coup 


> =p 1 


k=0 i^Xi_h^^2k 





+ p < 



k=Q ieT2-hi+2k 



> (1 - i/)x n 



coup 



< p < 



E E 22 

fc=o«eXi_j,^+2fc 



> z/x > + P 



E E 2.' 

fe=0jeX2-;.i+2fc 



> (1 - i^)x > . 
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By construction, (EieXi„h,+2fc ^i)fc=o,...,fe„ and (Hiax^.^^+^k ^Dfc=o,...A are 
independent and upper bounded by + r„) / ^^ fiiuj)- Let 



k=0 



From Lemma 20, we have 



o-f < 2 |^^> + 
0-^ < 2 I ^> + 



1 



^(w) ^Card{Ti_/,i+2A:} , 

k=0 

In 



EVar| E ^« 



fc=0 



A- 1 
1 

A- 1 



k=0 



Therefore, for L = 2y^^> + (A - 



m = ^ Card {Xi_/i^+2fc } and n2 = ^ Card {X2-/ii+2fc } , 

k=0 k=0 

Benett's inequahty (see Lemma 25) yields that for all y > 0, 



E E 2.' 

fc=0ieXi_;ij+2fc 



E E 

fc=0jGX2-hi+2fc 

In (28), we choose 

X = L{^/ni + y/n2)y/y + 



> L^niy + 



jdn + rn)y 



> LJrvry + 



(t^n + ?-n)y 

3(z.A(l-z.))yM^' ' 
We have ^/ni + ^Jni < 2^^ t — |a;| + 1 and 



< 26-2^, 



<2e-J' . 



ni A n2 ^ ffn - 3 + /ii ^ pn. - 1 - 2 ^ 1 



1 1 

> 



ni + n2 2p„ - 3 + /ii 2(p„ - 1) 2 Pn - 1 6 

Hence, (z^ A (1 — i^)) > (n\ A n2) / (ni + "^2) > 1/3 and we have obtained 
that, for all y > 0, 



^ o r I 11,1 ^ , ('^n + "^niV O 

> 2L\/n — + \^y H , fl \l 



j=|aj| 

This result can be rewritten as (17). 



coup 



<4e-y . 
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B.2. A complement for slope heuristic in the mixing case. 

Proposition 12. Let {Xn)nez be a (j)-mixing process satisfying (MC), 
(ND) and (GMC). 

jc-]") = {u; G J",: Va G ^, {n-\uj\ + l)/2„(a;a) G {0} U [(Inn)^, +oo [ } . 

(n) 

Let T < Ti, be two trees in F), . For any k > dn + 1, let us define 

Let recoup be the event defined in (27). For any x > 0, e G (0, 1), let dn 
logn, 



k=d„+l 



r. ,^ r.. .e-^L^'""*^ V (Inn 



(n - dn) 

<P{f]^,„p}+2e- 



Remark 15. ^^=dn+i '^^ essentially equal to L{Pt-^) — L{Pr). Propo- 
sition 12 and a union bound state then that, in the mixing case, the event 
^mis defined in Theorem 9 holds if J^l"^ is contained in a fixed set of trees 
with cardinality polynomial in n. 

Proof. Let us keep the notation of the proof of Theorem 10. 
If3j G {0,...,£n} :{k-dn + l,...,k} C J'^, let 

Otherwise, let 

4= E ^'-^i-r-- -;'"">, 

, , , n-\uj\ \Pria\u;r) 

For any j = 0, . . . , k„, we denote by Ij the set of values of k such that 
{k — dn + 1, ■ ■ ■ , k} C Jj and, for any j = 0, . . . , by I'j the set of k such 
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that {k — dn + I, ■ ■ ■ ,k} C Jj. For any x > and u G (0, 1), we have 



k = dn + l 



> X 



+ p< 



j=o feex' 



j=Q kelj 
> (1 - u)x n ftcoup 



coup 













<P{^^co«p}+Ip| 


EE^i 




EE^i 


>(1 




j=0 k£Xj 









The random variables {^kei- ^k)j=o,...,Kn and Cl2kei' ^fc)i=o,.-/n ^^'^ ™- 
dependent by construction. Therefore, Benett's inequahty yields, for any 
x>0, 



(29) 



EE^i 



j=o keXj 
In the previous inequality. 



0-1 > V Var ( V ^fc I and b> max V 
The typicality property implies that, for n large enough, 

fn ~l~ (in 



max 



In 



(30) 



< 



{rn + dn) ln(2n) 



Moreover, for any n > (i„ + 1, by stationarity of Xf, 



Var I ^ = ^ Var(Zfe) + ^ Cov(Zfe,ZfcO 



= Card{X; }Var(Z„) + 2 ^ (r^ + « - A; + l)Cov (Z„, Z^) . 

fe=u+l 
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We have 



(cj,a)er^x A 



{ui,a)£T^xA 

Pr^{a\uj) \\'^ fi{uja) 



Pr{a\uJr) J J {n- dnY 



Lemma 23 gives 



(31) Var(Zj < (P,^, P,) . 

In addition, using Lemma 20, we get, for 

m+(n, k) = max | 4>k-u-\ui\+l^k-u-\ui\+l>Q + A''""lfc_n-|a;|+l<0 } 



(32) (n-d„)2Cov(Z„,Zfc) 

{(a;,a),(L^',a')}e(r,xA)2 



In 



PT-{a\ijJr) 



In 



(33) 



X Gov ( Ix 



, =ui'a, J-xf I ,=aja 



<m^{u,k) -\/ fi{uja) 

\ (aJ,a)er*xA 

The Cauchy-Schwarz inequahty yields 



In 



PT-{a\uJr) 



PT-(a|6t;T-) 
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Using Lemma 23, 



In 



Pr{a\ujT-) 



(w,a)eTt X A 

Plugging this inequality in (32) gives 

As |a;| < (in; we have, under (GMC), m_|_ 



We always also have the basic inequality 



Cov(Z„,Zfc) < Var(Z„) < 



L 



{n - dn) 



Therefore 



Var I '^Zkj = Card{Xj}Var(Z„) 

u+rn 

+ 2 E {rn + u- k + l)Covi Zu,Zk) 



k=u+l 



< 2- 



(r,T*)(r„+d„) 



(n - ci„)^ 



-K^.APr.,Pr 



CO 

E(lA{A^(r.)|^| (L^i^e^— '^^ Vl) (AVe"^— )'=}) 



A:=0 



As A^(n) < n and e'^'"'^'^" < n'''""^/''^(l"^l\ we can cut the separate between 
thek < In (n^+T™"/ 1=^(1^1)) / ln(AVe-'>'— ) and the > In (ni+T^^/i'^d^D ) /ln(AV 



), and we obtain that there exists a constant L := L{Lmix,'ymix, \ A\, A) 



such that 



L 



Var I E ^ ^'-Jn^^^''-*^^^-^^ 



fcex, 



Plugging this inequality and (30) in (29) gives 



EE4 



{n-dn? 



K^^(Pr^,Pr)x + 



, ln(2n) X 



n — dn 3 

< 2e-^ 
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Using the basic inequality 2ab < ea^ + e~^b'^, we finally get 
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The same control holds for 



(n - dn) 



, hence 



k=dn + ^ 



1 r(^.^*) 



V (In n) 



(n - dn) 



□ 



APPENDIX C: LINKS WITH CONTINUITY RATES 



C.l. Control of the bias with the continuity rates. An important 
tool in the theory of chains of infinite order is the continuity rates defined, 
V(al^, a, k) e A-^ xAxW, by 



efc(a_^a) := 



sup 

(6li-\cI^-i)6(A-«)2 



Pia\b-J^'a-_i)-Pia\c 



-oo "-fc; 



Let us remark that, for all (a_^,a, fc) G ^ ^ x A x N*, efc(a_^a) only 
depends on {aZ\,a), therefore, we will also use the following notation 

V(al;^,u;,a) G A'^ x A* x A, e(a;,a) := e|^^| (al;^a;, a) . 

These continuity rates can be used to upper bound the bias term of the risk. 
In order to see this, we introduce the following definition 

VtgT, ||e||^ := / d|u(a;) Ve|<^^|(a;a)^ . 
•^^-'^ aeA 

Proposition 13. Let t be a finite context tree, let r] G (0, e~^) and let 
= I w G A~^ : 3a G A, Pr < r? | . 

Then, 



(34) 



K^,{P,Pr) <^^ + \A\rjln(^^^fi{nr,] 
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Proposition (13) can be used under the following assumption 

(GC) > : V(w, a) G A'^ x A, P{a\u) > . 

In that case, ^1,, = for all r] < K^^, hence (34) yields 

K^,{P,Pr) < K,\\e\\l . 
If, on the other hand, /u(ilr;) > for all rj > 0, we can choose 

,-1/2 



V = \\4r (f^lklU 

and get an absolute constant C such that, for all r £ (0, 1), 

Proof. By definition, for all (a;, a) G A^^ x A, 

(35) \P{a\uj) - Pr{a\u))\ < e\^^\{uja) . 
In addition, we have 

(36) K,{P,Pr)= [ d^{u;)Pia\u;)ln(^^) 

Ja~^'IxA \Pr{a\uj)J 

Let r] G (0, e~i) and let 

n^ ,^ = I (uj, a) G X A : Pr{a\uj) > r/ 1 , 

^2/^ = { (w, a) ^ r^i^^ : PT-(a|a;) < P{a\uj) } , 
^^3,r; = { (o;, a) G T X ^ : PT-{a\u}) < r]} . 
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From (36), we have 

K,{P,Pr)={j +[ ]d^,{u;)Pia\u)ln(^^) 
\Jni.„ Jn-J \Pr[a\uj)J 



\Jni,^ Jn2,,J \Pr{a\uj)J 

\Pr{a\uj)J 
+ I dfi{uj)P{a\uj)ln( 

< I d^{uj)P{a\uj)hi ' ^ ' 



Hi,,, \Pr{a\uj) 



We use the bound 



We obtain 



(ci;,a)er23,^ 



\fx <r], xln I — ) < 77ln 



^ firHPr{a\u;)ln(-^—\<\A\riln(-^f,{nr,} . 

In addition, since Oi^.^ does not depend on the pasts before r, fj,{Qi^r]) 
fJ'T{^i,ri) and, using that Vx > 0, ln(2;) < x — 1, we obtain 

dKu^)P{a\u;)ln(p^) 
\PT[a\u;)J 

P{a\iLi) — PT-{a\u) 



dij,{uj) (P{a\uj) — Pr{a\uj) + Pr{a\uj)) 

V Pria\uj) 

<- I d/i(a;) (P(a|w) -P^(a|a;))^ + -/x^(Oi ^) 



< ^ / d/i(a;) Ve|^^|(a;a) 
^ JAN 



I l|2 



□ 
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C.2. Mixing properties and continuity rates. 0-mixing conditions 
can also be deduced from continuity. In order to see that, let us recall the 
following equivalent definition of 0- mixing coefficient (see [f f] prop 3.22) 

(Pik) = sup sup P{X^+' G E\aiX^_^)) - P{E) 
Let us introduce the following assumptions. 



(EC) 3(C,a) G (Ml) > : > 0, 1 - inf V inf Pia\ujJ) < Ce 



at 



(RC) 



3pmin > : V(a,w) e Ax A'^ , P{a\uj) > . 



From Theorem 4.1 and Corollary 4.1 in [16], under assumptions (EC) and 
(RC), there exists Q > 0, > such that 



sup sup 



Let then 11 el 



P{X',ll e E\a{Xl^)) - P{E)\ 



< 2Ci J]]e-"'('=+^) < - 



2a 



-otik 



j=0 



k,OD 



sup(aj,a)eA-Nxyi ^A:(w, a). It is clear that 



1 — inf > inf P(a\ujuj') < llel 



fc,oo 



Therefore, we have proved the following proposition. 

Proposition 14. Every stationary ergodic process satisfying (RC) and 
such that ||e||fcoo decreases exponentially satisfies Assumption (CMC). 

APPENDIX D: TECHNICAL TOOLS 
In the main proofs, we used the following lemmas. 

D.l. Decomposition of the risk. 

Lemma 15. For all t £ J^, 

I dij{Lua)ln{P{a\uj)) + (tjo) In [ | 

= K^{P, Pr) - KpiPr, Pr) + L{P^) , 
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where 

L{Pr) := 

(ai,a)eTxA 

Proof. 



L{Pt)-= Yli (Muja) - n{uja)) In ^ ^ ^^^^^ j 



/ d/x(a;a) In (P(a|a;) ) + /x„(a;a) In [ | 

JA-f^xA I T^^. \Pria\u) ) 

+ Yl M^a)ln ( ) 

= K^{P,Pr) + L(Pr) + ^ Mnl^^a) In [ ^ 

(uj,a)€TxA yPria 



{uj,a)& 

= K^{P,Pr) + HPr) - Kj,{Pr, Pr 



D.2. Control of L(P^) - -L(Pr')- 



PT{a\uj) 
u) 



□ 



Lemma 16. For all {t,t') G J-"^, letT{T,T') be the unique tree satisfying 
the following conditions. 

1. T-<TiT,T') andT' <T{t,t'). 

2. T{t,t') C rUr'. 

Then, 

L{P,) - L{P,,) = (/2nM-/xM)ln(^^|^) . 

{u},a)eT{T,T')xA ^ ry \ t) J 

Proof. The result follows from the following remark. Let t ^ J- and let 
r be any element of such that r -< r. As /U and fi are probability measures, 
we have 



-^(-P-r) = Y (M^^a) - n{ua)) In ^ -^-^ 



OJ-r 



{ix),a)^TxA 

□ 
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Lemma 17. Let (r, r') G and /ei T{t, t') he the associated tree defined 
in Lemma 16. Let 

|/2n(^) - 



rj = max 

wer(r,T'),M('^)7^0 



L 



ir,T') 



max 



Pr{a\ujT-) V Pr'{a\uJT-') 
(w,a)er(r,r')xA, PT-Ta|'^r)AP^/(a|w^/)^0 1^ Pr{a\uJr) A P-r' {a\u)r') 

Then, for all e > 

L{Pr) - L{Pr>) <(e + 0(77)) {K^{P,Pr) + K^{P,Pr')) 



L 



+ (1 + 0{V))- K^^ {Pr, Pr) + K^^, {Pr',Pr 



Proof. From Lemma 16, we have 



(/In-l (^) - M(^)) Y1 ^nr,r') ( l^^^N ^ ) 



uj€T(t,t') 

(37) + ^ /I„_i(a;) J^(Pr(r,r')(a|w)-^(a|w))ln 
We have, for r* = r or r' 



Pr/(a|a;T-/ 



wer(T,T') aeA ^ TV I T ; 



< 



a\ijjj 

PT-{a\Ur*) 



ujeT{T,T') aeA 

(38) =7?i^^^^_,^(Pr(r,r'),^rO 

Hence, in (37), we have 



(/2n-i(a;) - ^(a;)) ^ Pr(r.ro(Q|^) 1" ( ^pTjl^'r^ ) 

<.der(r,r') aeA \ rl | rJ / 

/ P7-(t- ,-/■) (a|a;) 

(/i(a;) - /In-i(w)) > Pr(r,r')(a|^) In „ \ , ^ 



tjeT(T,r') aeA 
<r?(/C^(P,P,) + i^^(P,P,0) 
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Moreover, 

J2 /^n-i(w) ^{Pr{r,r'){a\^) " P(a|aj)) In T ^' 

(39) = /2„_iH^(Pr(r,r')(«l^)-^(«l^)) 

ajeT(r,T') aeA 

(40) /Pr(..oW\_^^/Pr(.,o(«M 



PT-(a|Wr) / V P-rK^I'^r') 

By Cauchy-Schwarz inequality, we have, for r* = r or r', 
(41) 



(Pr(.y)(a|^) - P(ak))^ y p f 1, f PnryM^) - - ' 



Since ^(r, r') C r U r', we have 

(Pr(r,x')(«l") - -P(<'l'^))' 



E 

{Pr{a\u:) - P{a\uj)Y ^ (Pr'(a|w) - P{a\uj)f 



-fV{r,r')(a|w) 



< 



From (48), (49) and (50) in the proof of Lemma 19, we have, for r* = r 



In addition, since 7~(t, r') C t U t', for r* = r or r', we have 

< 4-'' (P^,.,,,(«MAF..(aK.)) (l"(^fg|^))' ■ 



a£A 
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From Lemma 23, we obtain 

From (39) and (41), for all e > 0, we have therefore. 



+ (1 + 0{ri))— K^^{Pr, Pr) + K^^, {P,.,P^, 



□ 



D.3. Upper bounds on K^J^^, Kji. 



Lemma 18. Let {Xn)nez be a stationary ergodic process satisfying as- 
sumption (CC). Let 6 > and let J%["^((5) and be the sets defined 
in (6) and (7) respectively. Let Ttypiv) ^he set defined in (4) and let 
^good be the event (5). Then, on ^good, for all r G Ti"'\6), there exists 



V = O { a/ttt) V -X2y ) such that 



(42) K,,(P„p,,<(3+,)(^+ s:] ^ 1"(^) 

V V A„ / (uj,a)GTxA V V ; / 

(43) A,(P.,F,)<(3 + ,)(^+^] E in(^) 



Proof. It comes from Lemma 19 that, for some r] = O [ (ly V 



(1) V J"2) I ! 



(44) K^{Pr,Pr)< 

{'jlnioja) - lJL{oja)f + 2 (/i„_i(a;) - ^(w))' 



(1 + ^) E 



/i(w) 

{LO,a)&TxA,fi{uia)j^0 



ORACLE APPROACH IN CONTEXT TREE ESTIMATION 45 

By Proposition 5, C hence, /i(a;a) > An^ gnln{l/{'K{Loa)6)). 

Thus 



' Pnpi{toa) In ( ) + £i„ In 



TT{(jja)5 J \ ■K{oja)5 



We obtain in the same way that 



We deduce from assumption (CC) that 



Plugging this last inequality in (44) yields (42). (43) is obtained with the 
same inequality since, from Lemma 19 we have for r] = O [ fiy V — I , 



V V An ' A„ ^ 

^-^ (/I„(a;a) -/i(a;a))V2(/i„_i(a;) -/i(w))^ 

-^^^^ ^ W) ■ 

□ 

D.4. Consequences of typicality. 

Lemma 19. Let r/ < 1/3 and let r C Ttyp{ri). We have 
(45) K^APr,Pr) 

/ _ X 2 

/I 2?7 \ ^ , {Pr{a\uj)-P{a\uj) 

-U + 3(r^J ^^4^;^ — 
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Moreover, if we denote by 



Kj:{Pr,Pr) := ^ /i„(u;a) In 

{uj,a)GTxA 



PT-{a\uj) 
PT-{a\uj) 



we have 



(46) Kn(Pr,Pr) 



27] 



2 3(1-3??) 



In addition, if -q ^ 0, 



Pr{a\uj) — P{a\u;] 

Pr{a\ljj) 



(47) 



K^APr,Pr) - K-^{Pr,Pr) = 0{rj)K^^{P^, P^) 



Finally, for all (w, a) £ t x A, : fj,{uja) ^ 0, 



,Pr{a\^) - P{^\^) ) ^ ^ JMu^a) - Ku^a))' ^ ^ (/In-i(^) - M^))^ 
Pr(a|w) ^J'{^) /^('^) 

For all (w, a) £ t x A, : fln{i^a) ^ 0, 



P.(aM-P(aMj ^ J^,(^a)-^(^«))2 (^„_,(^) - ^(^))2 
l^ny-^) — : — ^ ^ ^ 7 — ^ r^" 



Pr{a\u:) 



Proof. Let us first remark that, for all u in t such that fj,{uja) ^ 0, we 
have, 
(48) 



Pr{a\uj) — Pr{a\uj) 



Pr(a|w) 



2ri 

<-^<l, 
1 — r] 



PT-{a\Ljj) — PT-{a\uj) 



Pr{a\uj) 



<-^<l . 
1 — r/ 



In addition, for all ry' < 1, for all u < r]', we have 

,2 



(49) 



ln(l — u) — u 



u 



< u 



2 V 



3(1 - V') 



(45), (46) and (47) follow, plugging (48) and (49) in (50) and (51), where. 
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fQj. ^ ^ Pr{a\u)-P(,a\u) 



Pr(a|a)) 

{P{a\u) - Pr{a\uj)f 



Pr{a\ljj) 
(a),a)6TxA, /i(wa)^0 



(u;,a)eTXj4 

(a;,a)erxA, /i(a;tt)7^0 ' 

(50) = J] /^(wo) (-ln(l -«) -u- (ii)^) 

(aj,a)erxA, n((jja)^0 

And, foru= 



Kji{Pr,Pr)- 



(P(a|a;) - Pr(a|w)) 



2 



(a;,a)erxA, p„(a;o)^0 ^T(a|a;) 
(w,a)eTxA \ ' V 

(51) = Y ^nii^a) (^-ln{l - u) - u- {uf^ . 

(u),a)erxA, p„(u>a)^0 

The bounds on /x(a;) ^ ^"^^°p''(a|^)"^^'* follow from the inequalities 

PAa\u) - Pialu) < \^i^^)- Mu:a)\ - /2n i(u;)| 

/x(a;) /x(a;) 

Pr(a <^) - P{a\uj) < ^ ^ h P(a\uj) ^ ^ 

/^n-l(w) Mn-l(w) 
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These imply in particular, since rj < 1/3 
Pr{a\uj) — P{a\uj 



^ \fi{uja) -]2n{uja)\ ^ 



< 



Y^/i(a;)/U„_i(a;j 

(^P{a\uj) V P{a\uj 
\fi{uja) - /x„(a;a)| 



l^(^) - 'fin-iiu})\ 
i/^(a;)/i„_i(w) 



a\uj 



Y^/i(a;)/I„„i(tj) 



□ 



D.5. Tools for mixing processes. 



Lemma 20. Let {Xn)nez be a (p-mixing process satisfying (MC) and 
(ND). Then, for all u e A* , 



(52) 5^ Gov ( 1^0 -^Ax^^ , 

^ — * I \ — |a;| + l k—\uj\ + l 



k=0 



< ^ + 



1 



As a consequence, for all N gN* , 



\ k=0 / 



2N { + 



1 



A- 1 



1 - A 



If, in addition, for any (a, w) G Ax A* , P{Xq = a\X[ = w) < A, then, for 
all {uj,uj') G (A*)^, 



Cov( 1^0 ,,lxk 

— \uj' 1+1 k—\u\-\-l 



< Y</'fc-|a;|+llfc-|a;|+l>0 + ^k~\w\+l<oj • 

Proof. If A; — + 1 > 0, we use Lemma 22 and we get 



Cov( l^o ,,1^^ 

— \ui'\ + l k-\uj\ + l 



,) < ^J(|)k~\w\+l^J'{^)K^') ■ 



Therefore, 

oo 

(53) Yl |Cov(lxO| 

fc=|<^|-i 



'l+i=-"^^t|.|+i= 



k=0 
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li k — \uj\ + 1 < 0, denoting by a; = condition (ND) implies 



Gov 



1=1 



This is sufficient to obtain (52), choosing uj' = lo. As a consequence, 

\fc=0 / k,k'=0 ^ I IT / 

N 

< 2^{N -k + 1 

oo 

<2iV^|Cov(l^o , 

^ — ' I \ -k^ +1 fc- +1 



Cov( 1^0 -u^,'i-X^ , ,^,= 



k=0 



<2N { ^> + 



A- 1 



An argument symmetric to the one in (53) shows that, if A; — |a;| + 1 < 0, 
when we have moreover, for any {a,uj) £ A x A*, P{Xo = a|x|'^' = w) < A, 



Gov \ lyO _, ,/ , 1 yfc 



Thus, 
Gov ( 



< AV(w) a ii{uj') < . 



□ 



The following lemmas are due to Viennet [42] 



Lemma 21. Let {Xn)nGZ be a ji-mixing process. Let {Jk)k=i,...,N be a 
collection of subsets of N satisfying the following conditions. 

1. 3 Qo & N* such that, for all k = 1, . . . , N — 1, max{i € Jfc} < Qo + 
mm{j £ Jfc+i} . 

2. 3Mo£ N* such that, for all k = I, . . . , N , Gard {Jk} < Mo. 

Then, there exists random variables {Yi)-^yjN such that, 

1. for all k = 1, . . . , N , (Kj)jgj^ has the same distribution as (Xj)jgj^, 

2. for all k = 2, . . . ,N, (yj)iej^ is independent of {Xi,Yi)i^Ut<k-iJt^ 
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3. for aUk = l,...,N, F{{X^)i^j, / {Yi)i^j, } < Mo/3g„. 

Lemma 22. Let X, Y he two real valued random variables. There exists 
two real functions hi and 62 such that, for all hounded functions f and g, 

(54) ll&ilL < </>KX),a(y)), II62IL < <P{'j{Y),a{X)) . 

(55) Cov(/(X),5(r)) <v/lE(&iWrW)IE(&2W(y)) . 

D.6. Additional lemmas. The following lemma can be fomid, for ex- 
ample, in [34] Lemma 7.24. 

Lemma 23. For all prohahility measures P, Q with P « Q, 

Lemma 24. Let ^ he a prohahility measure with kernel P and let r 
he a finite tree. Then, for all u G Mr with transition kernel Q such that 
Kfj_{P,Q) < 00, we have 

K^iP, Q) = K^{P, Pr) + {Pr, Qr) ■ 

Proof. By definition, 

JA-f^xA \Q{a\u})J 

= I d,{.a) m ( ) + / d,{.a) m ( ) 

Ja-^^xA \Pr{a\u})J Ja-^>xA \ Q{a\uJ) J 

=K,{P,Pr)+ [ dM-a)ln(^^J^) . 

Ja~^xA V Q{a\uj) J 

For all u! S ^4"^, let uji E such that uj = coicor- As the function 

In ^ ^Q(^a\!^-I ^ does not depend on wi, is equal to In ^ q^(q|^"") ^ and fi satisfies, 
for all {uj,a) £ t x A, /^^g^-N dfi{ujiuja) = ^(uja) = /iT-(wa), we have 




ORACLE APPROACH IN CONTEXT TREE ESTIMATION 51 



Lemma 25. (Benett's inequality) Let ^i-^ be independent random vari- 
ables such that, \/i = 1, . . . , N , \\S,i\\oo ^ b. Then, for all y > 0, 



N 



^(y.-E(y,)> 



1=1 



N 



^2^Var(y)2/+^ \ < e'y 
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